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Abstract Given a finite set of words Wi, . . . , w n independently drawn according 
to a fixed unknown distribution law P called a stochastic language, an usual goal 
in Grammatical Inference is to infer an estimate of P in some class of probabilistic 
models, such as Probabilistic Automata (PA). Here, we study the class S^ at (S) 
of rational stochastic languages, which consists in stochastic languages that can 
be generated by Multiplicity Automata (MA) and which strictly includes the class 
of stochastic languages generated by PA. Rational stochastic languages have min- 
imal normal representation which may be very concise, and whose parameters 
can be efficiently estimated from stochastic samples. We design an efficient infer- 
ence algorithm DEES which aims at building a minimal normal representation of 
the target. Despite the fact that no recursively enumerable class of MA computes 
exactly SQ at (E), we show that DEES strongly identifies SQ at (E) in the limit. 
We study the intermediary MA output by DEES and show that they compute ra- 
tional series which converge absolutely to one and which can be used to provide 
stochastic languages which closely estimate the target. 



1 Introduction 

In probabilistic grammatical inference, it is supposed that data arise in the form of a fi- 
nite set of words wi , . . . , w n , built on a predefinite alphabet S, and independently drawn 
according to a fixed unknown distribution law on S* called a stochastic language. Then, 
an usual goal is to try to infer an estimate of this distribution law in some class of proba- 
bilistic models, such as Probabilistic Automata (PA), which have the same expressivity 
as Hidden Markov Models (HMM). PA are identifiable in the limit 1 6 1. However, to our 
knowledge, there exists no efficient inference algorithm able to deal with the whole class 
of stochastic languages that can be generated from PA. Most of the previous works use 
restricted subclasses of PA such as Probabilistic Deterministic Automata (PDA) 151 121 . 
In the other hand, Probabilistic Automata are particular cases of Multiplicity Automata, 
and stochastic languages which can be generated by multiplicity automata are special 
cases of rational languages that we call rational stochastic languages. MA have been 
used in grammatical inference in a variant of the exact learning model of Angluin 131 1121 
but not in probabilistic grammatical inference. Let us design by 5^?' (£), the class of ra- 
tional stochastic languages over the semiring K. When K = Q + or K = R + , S^?* (£) 
is exactly the class of stochastic languages generated by PA with parameters in K. But, 
when K = Q or K = K, we obtain strictly greater classes which provide several advan- 
tages and at least one drawback: elements of S^+{E) may have significantly smaller 
representation in <S£?*(i7) which is clearly an advantage from a learning perspective; 
elements of 6>£?*(I7) have a minimal normal representation while such normal repre- 
sentations do not exist for PA; parameters of these minimal representations are directly 



related to probabilities of some natural events of the form uS*, which can be efficiently 
estimated from stochastic samples; lastly, when K is a field, rational series over K form 
a vector space and efficient linear algebra techniques can be used to deal with rational 
stochastic languages. However, the class Sq lt {S) presents a serious drawback : there ex- 
ists no recursively enumerable subset of MA which exactly generates it |6)- Moreover, 
this class of representations is unstable: arbitrarily close to an MA which generates a 
stochastic language, we may find MA whose associated rational series r takes negative 
values and is not absolutely convergent: the global weight Ylwes* r ( w ) mav be un- 
bounded or not (absolutely) defined. However, we show that Sq 1 *^) is strongly identi- 
fiable in the limit: we design an algorithm DEES such that, for any target P E (£) 
and given access to an infinite sample S drawn according to P, will converge in a finite 
but unbounded number of steps to a minimal normal representation of P. Moreover, 
DEES is efficient: it runs within polynomial time in the size of the input and it computes 
a minimal number of parameters with classical statistical rates of convergence. How- 
ever, before converging to the target, DEES output MA which are close to the target 
but which do not compute stochastic languages. The question is: what kind of guar- 
antees do we have on these intermediary hypotheses and how can we use them for a 
probabilistic inference purpose? We show that, since the algorithm aims at building a 
minimal normal representation of the target, the intermediary hypotheses r output by 
DEES have a nice property: they absolutely converge to 1, i.e. r = J2we£* \ r ( w ) I < 00 
and J2k>o r (^ k ) — 1- As a consequence, r(X) is defined without ambiguity for any 
X C £*, and it can be shown that N r = ^r(«)<o \ r ( u )\ tends to as the learning 
proceeds. Given any such series r, we can efficiently compute a stochastic language p r , 
which is not rational, but has the property that e Nr / r < p r (u) /r(u) < 1 for any word u 
such that r(u > 0). Our conclusion is that, despite the fact that no recursively enumer- 
able class of MA represents the class of rational stochastic languages, MA can be used 
efficiently to infer such stochastic languages. 

Classical notions on stochastic languages, rational series, and multiplicity automata 
are recalled in Section |2 We study an example which shows that the representation 
of rational stochastic languages by MA with real parameters may be very concise. We 
introduce our inference algorithm DEES in Section [3] and we show that Sq 1 *^) is 
strongly indentifiable in the limit. We study the properties of the MA output by DEES 
in Section |4] and we show that they define absolutely convergent rational series which 
can be used to compute stochastic languages which are estimates of the target. 

2 Preliminaries 

Formal power series and stochastic languages. Let S* be the set of words on the finite 
alphabet S. The empty word is denoted by e and the length of a word u is denoted by 
|4 For any integer k, let S k = {u E S* : \u\ = k} and = {u E S* : \u\ < k}. 
We denote by < the length-lexicographic order on A subset P of S* is prefixial if 
for any u,v E £*, uv E P => u E P. For any S C £*, let pref(S) = {u E S* : 3v E 
S*,uvE S} and fact(S) = {v E S* : 3m, w E E*,uvw E S}. 

Let £ be a finite alphabet and K a semiring. A formal power series is a mapping r 
of S* into K. In this paper, we always suppose that K E {M, Q, K+, The set of 
all formal power series is denoted by K ((£)}. Let us denote by supp(r) the support of 
r, i.e. the set {w E S* : r(w) ^ 0}. 



A stochastic language is a formal series p which takes its values in R + and such that 
J2w£E' p( w ) = 1- F° r an y language L C £*, let us denote YlweL p( w ) p{L)- The 
set of all stochastic languages over S is denoted by S(U). For any stochastic language 
p and any word u such that p{uS*) ^ 0, we define the stochastic language uT l p by 
u~ 1 p(w) = p(uE*) ' u ~ l P i s called the residual language of p wrt u. Let us denote by 
res(p) the set {u G S* : p(uS*) ^ 0} and by Res(p) the set {u~ 1 p : u G res(p)}. 
We call sample any finite sequence of words. Let S be a sample. We denote by P$ 
the empirical distribution on U* associated with S. A complete presentation of P is an 
infinite sequence S of words independently drawn according to P. We denote by S n 
the sequence composed of the n first words of S. We shall make a frequent use of the 
Borel-Cantelli Lemma which states that if is a sequence of events such that 

SfceN -f >r ( j 4fe) < 00 > tnen tne probability that a finite number of Ak occurs is 1. 

Automata. Let K be a semiring. A K -multiplicity automaton (MA) is a 5-tuple (£, Q, 
(p, l, t) where Q is a finite set of states, ip : QxUxQ^Kis the transition function, 
i : Q — > K is the initialization function and r : Q — > If is the termination function. 
Let Q/ = {g G Q|t(g) 7^ 0} be the set of initial states and Qt = {q G ^ 0} 

be the set of terminal states. The support of an MA A = (Z\ Q, i, t) is the NFA 
supp(A) = (S,Q,Qj,Qt 7 S) where (5(5, x) = {q' G Q\<p(q,x,q') ^ 0}. We extend 
the transition function ip to Q x £* x Q by ip(q, wx, r) = J^seq Vfe w i s )v( s ) x i r ) 
and <p(q, e, r) = 1 if q = r and otherwise, for any q,r G Q, x G S and w G Z"*. For 
any finite subset L C S* and any i? C Q, define 93(3, i, R) = ^2 weL reR <p(q, w, r). 

For any MA A, let ta be the series defined by ta{w) = ^2 q r£ Q i(q)ip(q, w, r)r(r). 
For any q G Q, we define the series r^ >g by r^ jg (w) = X^reQ Vw> w > t)t{t). A state 
5 G Q is accessible (resp. co-accessible) if there exists <jo G Q/ (resp. qt G Qt) and 
it G S* such that y(go; u , <?) 7^ (resp. 99(5, w, g t ) 7^ 0). An MA is trimmed if all its 
states are accessible and co-accessible. From now, we only consider trimmed MA. 

A Probabilistic Automaton (PA) is a trimmed MA (£, Q, cp, l, t) s.t. l, tp and r take 
their values in [0, 1], such that J2 q eQ = 1 an d f° r an Y state 9' T (?) + Q) 
1 Probabilistic automata generate stochastic languages. A Probabilistic Deterministic 
Automaton (PDA) is a PA whose support is deterministic. 

For any class C of multiplicity automata over K, let us denote by S^-(S) the class 
of all stochastic languages which are recognized by an element of C. 

Rational series and rational stochastic languages. Rational series have several char- 
acterization r il 1141101 ). Here, we shall say that a formal power series over £ is ir- 
rational iff there exists a A"-multiplicity automaton A such that r = ta, where K G 
{M,R+, Q, Q+}. Let us denote by K rat ((S)) the set of A'-rational series over S and 
by S^ t (£) = K rat ((S)) n S(S), the set of rational stochastic languages over K. 
Rational stochastic languages have been studied in |,7 1 from a language theoretical point 
of view. Inclusion relations between classes of rational stochastic languages are summa- 
rized on FigE It is worth noting that S£ DA (S) C S£ A (E) C S^ at (S). 

Let P be a rational stochastic language. The MA A = (£, Q, (p, t, t) is a reduced 
representation of P if (i) P = Pa, (ii) G Q, PA,q G «S(27) and (iii) the set {PA,q ■ 
q G Q} is linearly independent. It can be shown that Res(P) spans a finite dimensional 
vector subspace [Res(P)} of R((I7)). Let Qp be the smallest subset of res(P) s.t. 

: u G Qp} spans [Res(P)}. It is a finite prefixial subset of £*. Let A = 
(E,Qp, (p, l, t) be the MA defined by: 
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Figurel. Inclusion relations between classes of rational stochastic languages. 



- l{e) = 1, l(u) = otherwise; t(u) = u^ 1 P{e), 

- tp(u, x, ux) = u~ 1 P(xS*) if u, ux G Qp and x G S, 

- <p(u,x,v) = a v u~ 1 P{xS*) if x G S,ux G (Q P U\Qp)r)res(P) and (ux^P = 

It can be shown that A is a reduced representation of P; A is called the prefixial reduced 
representation of P. Note that the parameters of A correspond to natural components of 
the residual of P and can be estimated by using samples of P. 

We give below an example of a rational stochastic language which cannot be gen- 
erated by a PA. Moreover, for any integer N there exists a rational stochastic language 
which can be generated by a multiplicity automaton with 3 states and such that the 
smallest PA which generates it has N states. That is, considering rational stochastic lan- 
guage makes it possible to deal with stochastic languages which cannot be generated by 
PA; it also permits to significantly decrease the size of their representation. 

Proposition 1. For any a G K, let A a be the MA described on Fig. |2] Let S a = 

{(A , Ai, A 2 ) G K 3 : r Aa G S(E)}. Ifet/(2ir) = p/q G Q where p and q are rel- 
atively prime, S a is the convex hull of a polygon with q vertices which are the residual 
languages of any one of them. If a/(2ir) g" Q, S a is the convex hull of an ellipse, any 
point of which, is a stochastic language which cannot be computed by a PA. 

Proof (sketch). 

Let r qo , r qi and r q2 be the series associated with the states of A a . We have 

, „ . cos net — sin net , „ . cos net + sin net , „. 1 

r io( a ) = ^ ,r qi (a n ) = — and r 92 (a") = — . 

The sums £„ eN r qo (a™), ^„ eN r 9l (a" ) and £„ eN r «2 (a") converge since \r qi (a") = 
0(2"™) for % = 0, 1, 2. Let us denote a t = Y,„ e n r <?* ( a ™) for i = 0, 1, 2. Check that 

4 — 2 cos a — 2 sin a 4 — 2 cos a + 2 sin a 
cr = : -. , cti = and a 2 = 2. 

5 — 4 cos a 5 — 4 cos a 

Consider the 3-dimensional vector subspace V of R((U)) generated by r 9o , r gi and 
r q2 andletr = Aor go +Air gi +A 2 r g2 be ageneric element of V. We have ^ ngN r(a") = 
^o^o + Ai<7i + A 2 er 2 . The equation A <To + Ai.0"i + A 2 er 2 = 1 defines a plane W in V. 

Consider the constraints r(a n ) > for any n > 0. The elements r of 7i which 
satisfies all the constraints r(a n ) > are exactly the stochastic languages in H. 



If a/(27r) = k/h £ Q where k and h are relatively prime, the set of constraints 
{r(a n ) > 0} is finite: it delimites a convex regular polygon P in the plane H. Let p be 
a vertex of P. It can be shown that its residual languages are exactly the h vertices of P 
and any PA generating p must have at least h states. 

If a/ (27r) $ Q, the constraints delimite an ellipse E. Let p be an element of E. It 
can be shown, by using techniques developed in (7), that its residual languages are dense 
in E and that no PA can generate p. □ 

Matrices. We consider the Euclidan norm on W 1 : \\(xi, . . . ,x n )\\ = {x\ + . . .+x^ l ) 1 ^ 2 . 
For any R > 0, let us denote by B{0,R) the set {x £ R n : \\x\\ < R}. The in- 
duced norm on the set of n x n square matrices M over R is defined by: ||A/|| = 
sitp{ ||Mx|| : x £ R™ with ||a;|| = 1}. Some properties of the induced norm: ||Ma;|| < 
||M|| • ||x|| for all M £ W ixn ,x £ R n ; \\MN\\ < \\M\\ ■ \\N\\ for all M,N £ R" xn ; 
lim^^oo ||M fe || 1 / fe = p{M) where p(M) is the spectral radius of M, i.e. the maximum 
magnitude of the eigen values of M (Gelfand's Formula). 





Figure2. When Ao = A2 = 1 and Ai = 0, the MA A n / e defines a stochastic language 
P whose prefixed reduced representation is the MA B (with approximate values on 
transitions). In fact, P can be computed by a PDA and the smallest PA computing it is 
C. 



3 Identifying S^ t {S) in the limit. 

Let S be a non empty finite sample of S* , let Q be prefixial subset of pref(S), let v £ 
pref(S) \ Q, and let e > 0. We denote by I(Q, v, S, e) the following set of inequations 
over the set of variables {x u \u £ Q}: 

I(Q,v,S, e) = {Iv^PsiwU") - J2 XuU^PsiwS*^ < e\w £ fact(S)} U{^ x u = 1}. 

Let DEES be the following algorithm: 
Input: a sample S 

Output: a prefixial reduced MA A = {£, Q, tp, t, t) 
Q^{e}, i(e) = l, T (e) = Ps(e), F <- E Dpref(S) 
while F do { 



v = ux = MinF where u £ S* and a; G S, F <— F \ {u} 
if I(Q,v, S, \S\~ 1/3 ) has no solution then{ 

Q^QUW, t(V) = 0, r(«) = Ps(«)/Ps(«^*), 
<p(u,x,v) = P s (yE*)/P s (uE*),F <- FU {vx € res(P s )\x e E}} 
else { 

let (q!to)to6Q be a solution of /(Q, w, S, |S'| _1/ ' 3 ) 
f>(u, x, w) = a w Ps{vE*) for any iuGQ}} 



Lemma 1. Let P be a stochastic language and let uq, u\ 



,u n G Res(P) be such 



that {uq 1 P, u± P, . . . , u~ 1 P} is linearly independent. Then, with probability one, for 
any complete presentation S of P, there exist a positive number e and an integer M 
such that I({u\, . . . , u n }, uq, S m , e) has no solution for every m > M. 

Proof. Let 5 be a complete presentation of P. Suppose that for every e > and every 
integer M, there exists m > M such that I({ui, . . . , u„}, Uo, S m , e) has a solution. 
Then, for any integer k, there exists nik > k such that I({ui, . . . , u n }, u , S mk , 1/k) 
has a solution (ai,fc, . . . , a n ,fc)- Let p k = Max{l, |ai, fc |, . . . , |a„,fc|}, 70, k = 1/pfc and 
7t,fc = —cti.k/pk for 1 < i < n. For every fc, Maxj^^fcj : < i < n} = 1. Check that 



Vfc > 0, 



E 

i=0 



- l P s (wE*) 



1 1 
< < -• 

Pkk k 



There exists a subsequence (ai,^(fc), ■ • • , On,^(k)) of (ai,fe, • ■ ■ , a n ,fe) such that 
(7o,0(fe), • • • 1 ln,<f>(k)) converges to (70, . . . , 7„). We show below that we should have 
12?=a liU-T 1 ~P(w£*) = for every word w, which is contradictory with the indepen- 
dance assumption since Max{^i : < i < n} = 1. 

Let w G fact(supp(P)). With probability 1, there exists an integer fco suc h that 
w G fact(S mk ) for any fc > fc . For suc h a ^> we can write 



liU^P 



IWi 1 Ps m J + hi - 7i,<Hfc)K lp s mi; +7 l ,0(fc)" l 1 Ps, 1 



and therefore 



i=0 



- l P{wS* 



1 



< ^ lur^P - P Smk )(wS*))\ + J2 h - %m\ + J. 



which converges to when k tends to infinity. 



□ 



Let P be a stochastic language over E, let A = be a family of subsets of £*, 

let 5 be a finite sample drawn according to P, and let Pg be the empirical distribution 
associated with S. It can be shown 1 13191 that for any confidence parameter S, with a 
probability greater than 1 — S, for any i E I, 



|P S (^)-P(^)|< CV /^^ 



(1) 



where VC(,A) is the dimension of Vapnik-Chervonenkis of A and c is a constant. 

When A = ({wS*}) weE ; VG(A) < 2. Indeed, let r,s,t G S* and let Y = 
{r, s, t}. Let u rs (resp. u rt , u st ) be the longest prefix shared by r and s (resp. r and t, s 



and t). One of these 3 words is a prefix of the two other ones. Suppose that u rs is a prefix 
of u r t and u s t- Then, there exists no word w such that wE* ny = {r, s}. Therefore, 
no subset containing more than two elements can be shattered by A. 
Letff(e,*) = £(2-logf). 

Lemma 2. Let P £ S(S) and let S be a complete presentation of P. For any precision 
parameter e, any confidence parameter 5, any n > W (e, 8), with a probability greater 
than 1-6, \P n (wE*) - P{wE*)\ < eforall w £ E*. 

Proof. Use inequality □ 

Check that for any a such that —1/2 < a < and any j3 < —1, if we define 
gfc = k a and 6k = fc' 3 , there exists K such that for all k > K, we have k > &(ck, <5fc)- 
For such choices of a and j3, we have lim^oo e& = and Ylk>i ^ k < 00 ■ 

Lemma 3. Let P £ S{E), uq, u±, . . . , u n £ res(P) and a±, . . . , a n £ K be such that 
Uq 1 P = Y]™—i onuj P. Then, with probability one, for any complete presentation S 
of P, there exists K s.t. I({u\, . . . , u n }, Uq, Sk, fc" 1 / 3 ) has a solution for every k > K. 

Proof. Let S be a complete presentation of P. Let o<o = 1 and let R = Max{\ai\ : 
< i < n}. With probability one, there exists K\ s.t. Vfc > K%,\/i = 0, ...,n, 
W^Skl > <f([fc 1/3 (?i + l)i?]"\ [(n+ l)fc 2 ] -1 ). Let k > Kl For any X C £*, 

n n 

\^ 1 Ps h {X)-Y t a^T 1 Ps h {.X)\ < |«o 1 ^ fe (X)- Wo - 1 P(X)|+^ |a i ||«r 1 J , s fc (A-)-«r 1 p(x)|. 

i=l i=l 

From Lemma |2 with probability greater than 1 — 1/fc 2 , for any i = 0, . . . , n and 
any word w, [it" 1 P Sk (wE*) - ur 1 P(wE*)\ < [fc 1 / 3 (n + l)-/?]" 1 and therefore, 
\ut 1 P 8h (wE*)-YS = i<x i uT 1 P 8k (wE*)\ < fc-y 3 . 

For any integer fc > K\, let A k be the event: |u 1 Pg li a i u i i w ^*)\ > 

fc" 1 / 3 . Since Pr(Afe) < 1/fc 2 , the probability that a finite number of occurs is 1. 

Therefore, with probability 1, there exists an integer K such that for any fc > K, 
I({ui, . . . , u n }, uq, Sk, fc" 1 / 3 ) has a solution. □ 

Lemma 4. Let P £ S(E), let uq,u\,..., u„ £ res(P) such that {u^P, . . . , u~ x P} 
is linearly independent and let ai, . . . , ot„ £ K be such that Uq 1 P = X)"=i ctiU^ 1 P. 
Then, with probability one, for any complete presentation S of P, there exists an inte- 
ger K such that Vfc > K, any solution a\, . . . ,<£^ of I({u\, . . . , u n }, uq, Sk, fc -1 / 3 ) 
satisfies |ai — oti\ < 0(k~ 1 ^ 3 ) for 1 < i < n. 

Proof. Let w\, . . . ,w n £ E* be such that the square matrix M defined by M[i,j] = 
uj 1 P(u>iE*) for 1 <i,j< n is inversible. Let A = (ai, . . . ,«„)*, U = (u^ 1 P(u>iE*), 
. . . , u^ 1 P(w n E*)) t . We have MA = Uq. Let S be a complete presentation of P, let 
fc £ N and let oj, . . . , be a solution of I({u\, . . . , u n }, uq, Sk, fc" 1 / 3 ). Let Mk 
be the square matrix defined by Mk[i,j] = uJ 1 Ps k (wiE*) for 1 < i, j < n, let 
A k = (ai,...,a^y and[/ , fc = (u^ ^(wiS*), . . . ^(wnS*)) 1 . We have 

n n 

\\M k A k - C/ ,fc|| 2 = J2i u o lp s k (^E*) -J2^ u J lp s k (mE*)} 2 < nk- 2 '\ 
i=i j=i 



Check that 



A - A k = M-\MA -U + U - Uom + U , k - M k A k + M k A k - MA k ) 
and therefore, for any 1 < i < n 

\oci-Si\ < \\A-A k \\ < IIM^IKHl/o-Cro.fcll+n^A-^ + llMfc -M|||L4 fe ||. 

Now, by using Lemma [2] and Borel-Cantelli Lemma as in the proof of Lemma [3] with 
probability 1, there exists K such that for all k > K, \\Uq - U ,k\\ < 0(k~ 1/3 ) 
and ||M fe - M\\ < 0{k~^ 3 ). Therefore, for all k > K, any solution ai, . . . , a n of 
I({ui, . . . , u n }, uq, S k , fc -1 / 3 ) satisfies |a, — on\ < 0(fc -1 / 3 ) for 1 < i < n. □ 

Theorem 1. Let P £ S^ at (S) and A be the prefixial reduced representation of P. 
Then, with probability one, for any complete presentation S of P, there exists an in- 
teger K such that for any k > K, DEES(S k ) returns a multiplicity automaton A k 
whose support is the same as A's. Moreover, there exists a constant C such that for any 
parameter a of A, the corresponding parameter a k in A k satisfies \a — a k \ < Cfc -1 / 3 . 

Proof. Let Qp be the set of states of A, i.e. the smallest prefixial subset of res(P) 
such that : u G Qp} spans the same vector space as Res(P). Let u £ Qp, let 

Q u = {v 6 Qp\v < u} and let x E S. 

- If {v~ 1 P\v £ Q u U {wa;}} is linearly independent, from LemmaQ with probability 
1, there exists e ux and K ux such that for any k > K ux , I(Q U , ux, S k , e ux ) has no 
solution. 

- If there exists (a v ) v< zQ u such that (ux)~ 1 P = J2veQ a »"' 1 -P. from Lemma|3] 

with probability 1, there exists an integer K ux such that for any k > K ux , I(Q U , ux, S k , fc -1 / 3 ) 
has a solution. 

Therefore, with probability one, there exists an integer K such that for any k > K, 
DEES(S k ) returns a multiplicity automaton A k whose set of states is equal to Qp. 
Use Lemmas |2]and|4]to check the last part of the proposition. □ 

When the target is in Sq lt (S), DEES can be used to exactly identify it. The proof is 
based on the representation of real numbers by continuous fraction. See 1 8 1 for a survey 
on continuous fraction and for a similar application. 

Let (e„) be a sequence of non negative real numbers which converges to 0, let x £ Q, 
let (y„) be a sequence of elements of Q such that \x — y n \ < e„ for all but finitely many 
n. It can be shown that there exists an integer N such that, for any n > N, x is the 

unique rational number | which satisfies y n — | < e„ < . Moreover, the unique 
solution of these inequations can be computed from y n . 

Let P £ Sq 1 *^), let S be a complete presentation of P and let A k the MA output 
by DEES on input Sk- Let A k be the MA derived from A k by replacing every parameter 
a k with a solution - of a — - < k^ 1 / 4 < -k. 

K q q — — q 2 

Theorem 2. Let P £ S^^S) and A be the prefixial reduced representation of P. 
Then, with probability one, for any complete presentation S of P, there exists an integer 
K such that Vfc > K, DEES{S k ) returns an MA A k such that A k = A. 

Proof. From previous theorem, for every parameter a of A, the corresponding param- 
eter a k in A k satisfies \a — a k \ < Ck^ 1 / 3 for some constant C. Therefore, if k is 
sufficiently large, we have \a — a k \ < k^ 1 ^ 4 and there exists an integer K such that 

a = p/q is the unique solution of a - 



< fc^ < 4. 
— — q 2 



4 Learning rational stochastic languages 



We have seen that Sq 1 *^) is identifiable in the limit. Moreover, DEES runs in poly- 
nomial time and aims at computing a representation of the target which is minimal and 
whose parameters depends only on the target to be learned. DEES computes estimates 
which are proved to converge reasonably fast to these parameters. That is, DEES com- 
pute functions which are likely to be close to the target. But these functions are not 
stochastic languages and it remains to study how they can be used in a grammatical 
inference perspective. 

Any rational stochastic language P defines a vector subspace of R((U)) in which 
the stochastic languages form a compact convex subset. 

Proposition 2. Letpi, . . . ,p n ben independent stochastic languages. Then, A = { ~a = 
(ai, . . . , On) £ R" : y]"—i O-iVi £ «5(i7)} is a compact convex subset o/R™. 

Proof. First, check that for any ~a, [3 G A and any 7 G [0, 1], the series Y^i=i\l a i + 
(1 — j)/3i]pi is a stochastic language. Hence, A is convex. 

For every word w, the mapping a — > Y^i=i a iPi( w ) defined from R n into R is 
linear; and so is the mapping a — > J2"=i a i- ^ is closed since these mappings are 
continuous and since 

{n n ^ 

~a G R" : 2_j ai P j ( w ) — ^ ^ or ever y wor d w and a, = 1 > . 
i=i i=i J 

Now, let us show that A is bounded. Suppose that for any integer k, there exists 
a k G A such that || a & || > k. Since a 1| a k 1 1 belongs to the unit sphere in R" , which 
is compact, there exists a subsequence a^j such that (Xmia/W a ct>(k)\\ converges to 
some ~a satisfying \\~a \\ = 1. Let q k = Yh=i a i,kPi and r = £" = , a-ift. 

For any < A < II "adl, pi + A^£f = (1 - 77=^-77 )pi + -w^-nQk is a stochastic 

— II "ll'ri || ex || v || fc 1 1 ||a fc ||™ 

language since S(S) is convex; for every A > 0, p\ + A |ff' p ,l converges to pi + Ar 

II a <p{k) || 

when / — > 00, which is a stochastic language since /l is closed. Therefore, for any A > 0, 
Px + Ar is a stochastic language. Since pi(w) + Xr(w) G [0, 1] for every word w, we 
must have r = 0, i.e. a, = for any 1 < i < n since the languages pi, . . . ,p„ are 
independent, which is impossible since || a \\ = 1. Therefore, /l is bounded. □ 

The MA A output by DEES generally do not compute stochastic languages. How- 
ever, we wish that the series ta they compute share some properties with them. Next 
proposition gives sufficient conditions which guaranty that J2k>o r A{S ) = 1- 

Proposition 3. Let A = (£, Q = {q l7 . . . , q n }, ip, 1, r) be an MA and let M be the 
square matrix defined by M[i,j] = [<p{qi, S, qj)] 1<i j <n - Suppose that the spectral ra- 
dius of M satisfies p{M) < 1. Let~i = (t(<7i), . . . , i(q n )) and^r = (r(qi), . . . , T{q n )) t 

1. Then, the matrix (I — M ) is inversible and X)fc>o ^ converges to (I — M) _1 . 

2. Vq t G Q, VK > 0, Y, k >K r A, qi {Z k ) convergesto M K (/ - My^iJ]^) 
and^2 k>K TA(£ k ) converges to ~lM k (I — M) _1 V. 

3. IfVq G Q, r(q) + (p(q, S, Q) = 1, then Vg G Q, rA, q (J2k>o = If moreover 
E ?e o^) = l.^nr(E fe > S k ) = l. 



Proof. 1 . Since p(M) < 1, 1 is not an eigen value of M and / — Mis inversible. From 
Gelfand's formula, linifc^oo ||A/ fc || = 0. Since for any integer k, (I - M)(I + M + 
... + AI k ) = I - M k+1 , the sum £ fe>0 Mk converges to (I - M 

2. Since r A ^{S k ) = £? =1 M*[i, j]rfo). J2 k >K TA M {S k ) = M*£? =1 (l- 
Mj^Mrfo) andE fe >^^(i7 fe ) = E?=i ^> A , gi (i^ K ) = tM K (/- 
M) _1 V. 

3. Let Si = rA,m{E*) for 1 < i < n and ~~s = (si, . . . , s n )*. We have (7 - M)~? = 
V. Since I— M is inversible, there exists one and only one s such that (I — M)~s = 
~t. But since -r(g) +<£>(<?, Q) = 1 for any state g, the vector (1, . . . , 1)* is clearly 
a solution. Therefore, Sj = 1 for 1 < i < n. If E 9 eQ t (<?) = 1> men r (^*) = 
E, 6 qK9>a i9 (27*) = 1. □ 

Proposition 4. Let A = (£, Q, ip, i, r) foe a reduced representation of a stochastic 
language P. Let Q = {qi, . . . , q n } and let M be the square matrix defined by M[i, j] = 
[ip(qi, Qj)]i<i j< n - Then the spectral radius of M satisfies p(M) < 1. 

Proof From Prop. let R be such that {a 6 M" : E"=i a i p A, Qi € C 
-B(0, i?). For every u € res(PA) and every 1 < i < n, we have 

_1„ _ El<j<ny(gi-"'gj) P -A, gj 

Therefore, for every word u and every k, we have \<p(qi,u,qj)\ < R- PA,qi {uS* ) and 
\ i p(q i ,E k ,q j )\< \v(qi,u, qj )\<R-PA, qi (Z- k )- 

u£S k 

Now, let A be an eigen value of M associated with the eigen vector v and let i be an 
index such that | v% | = Max{ | Vj | : j = 1 , . . . , n} . For every integer k, we have 

n 

M k v = X k vand\X k Vl \ = | ^ <p( qi , ^ fe » I < ni? • 

which implies that |A| < 1 since Pa,^ (^- fe ) converges to when fc — > oo. □ 

If the spectral radius of a matrix is < 1, the power of M decrease exponentially fast. 

Lemma 5. Let M G R" x ™ be such that p(M) < 1. Z/zen, f/zere exz'sfs C G R one/ 
p G [0, \\such that for any integer k > 0, ||M fe || < Cp k . 

Proof. Let p £]p(M), 1[. From Gelfand's formula, there exists an integer K such that 
for any k > K, \\M k \\ 1 / k < p. Let C = Max{\\M h \\/p h : h< K}. Let k G N and let 
a, 6 G N be such that k = aK + b and b < K. We have 



||M fe || = \\M aK+b \\ < ||Af aJC ||||M ,, || < p aK \\M b \\ < p k ^- T A < Cp k . 



JIM 6 ' 



Proposition 5. Let P G <Sg (E). There exists a constant C and p G [0, 1[ such that for 
any integer k, P(S- k ) < Cp k . 



Proof. Let A = (S,Q,ip,L,T)bea reduced representation of P and let M be the square 
matrix defined by M[i, j] = [<£>(<&, 27, 9j)] 1<i j <n - From Prop.0] the spectral radius of 
M is <1. From Lemma[5] there exists C\ and p £ [0, 1[ such that ||M fe || < C x p k for 
every integer k. Let iX = (i(<7i), ■ ■ ■ , t{q n )) and ^4 = (r(qi), T(g n ))*. We have 

P(^ fc )<|M-||M*||.||(/-M)- 1 ||.||^||<C/ ) fc 

with C = dH^H • ||(1 - M) _1 || • ||ta||. □ 

It is not difficult to design an MA A which generates a stochastic language P and 
such that tp(q, u, q') is unbounded when u £ 27*. However, the next proposition proves 
that this situation never happens when A is a reduced representation of P. 

Proposition 6. Let P £ S^ at (U) and let A = (27, Q, tp, l, t) be a reduced representa- 
tion of P. Then, there exists a constant C and p £ [0, 1[ such that for any integer k and 
any pair of states q,q', J2 u es k \<fi(<li u i < l')\ ^ C ' p k '. 

Proof. Let k be an integer and let q, q' £ Q. Let Pk = {u £ 27 fc : ip(q, u, q') > 0} and 

N k = 27 fe \ P k . 

p - lp _ v p A«(uZ*) ,.-i Pa _ v ^eMSi^I p, 

uePk 2^ueP k ^A, q [u2, ) qiieQ ^ uePk F A , q {u2, ) 

is a stochastic language which is a linear combination of the independent stochastic 
languages PA,q"- From prop.|2] there exists a constant R which depends only on A s.t. 



ueP k 



= J2 <p(q,u,q')<Rj2 P A , q (u£*). 



Similarly, we have | E„ e jv fc I = T,ueN k l¥>(«> «>«')! < R T, u eN k p A,q( uS *)- 

Let C and p £]0, 1[ be such that PA, q {S- k ) < Cp k for any state q and any integer k. 
We have 

Y, \<p(q,u,q')\ <RY p aA^1 < RC P k . 



MA representation of rational stochastic languages are unstable (see Fig. [3}- Arbi- 
trarily close to an MA A which generates a stochastic language, we can find an MA B 
such that the sum Y^wen* t b{w) converges to any real number or even diverges. How- 
ever, the next theorem shows that when A is a reduced representation of a stochastic 
language, any MA B sufficiently close to A defines a series which is absolutely conver- 
gent. Moreover, simple syntactical conditions ensure that re(27*) = 1. 

Theorem 3. Let P £ S^ at (S) and let A = (27, Q, ipA, La-, Ta) be a reduced represen- 
tation of P. Let Ca and pa £]0, 1[ be such that for any integer k and any pair of states 
q,q', J2 U £E k \fA{q, u , q') < C-aPa- Then, for any p > pa, there exists C and a > 
such that for any MA B = (27 ',Q,ipg, lb,tb) satisfying 



Vq,q' G Q,Vx £ 27, \tp A (q,x,q') - tp B (q,X,q')\ < a 



(2) 




Figure3. These MA compute a series r t such that J2 w e£* r e( w ) = 1 if e 7^ and 
J2wee* r o( w ) = 2/5. Note that when e = 0, the series ro, qi and ro l92 are dependent. 



we have ^2 u£S k \ fB(q, u > <?')! < C p k for any pair of states q, q' and any integer k. As 
a consequence, the series tb is absolutely convergent. Moreover, if ' B satisfies also 



(3) 



V<7 G Q,T B (q) + tpB(q, S, Q) = 1 and 2j L B{q) = 1 

f/ie«, a ca« fee chosen such that implies that rs, q {S*) = 1 /or any state q and 
r B (S*) = 1. 

Proof. Let fc be such that (2nC J 4) 1 ^ fe < pj pa where n = \Q\. There exists a > such 
that for any MA B = (£, Q, ips, lb, tb) satisfying we have 

vW £ Q, ^2 \ip B {q,u,q') - (p A (q,u,q')\ <C A p k A . 

uGS k 

Since J2 U £E k IV-a(9> u j — we must have also 



n 



Let C\ = Max{J2 ueS <k \ip B (q,u,q')\ : q,q' € Q}. Let l,a,b G N such that 
I = ak + b and b < k. Let u <E E l and let u = w . . . it a where \v,i\ = k for < i < a 
and \u a \ = b. For any pair of states q , q a +i, we have 

a 

<fB{qo,u,q a+1 ) = ^2 Yi^Biqi^i^i+i) 

qi,...,q a £Q i=0 

and 

a 

^2 ip B (qo,u,q a+1 ) = ^2 XI E Y[ l fiB(qi,u u q l+1 ) 

ueS 1 u ,...,Ua-l&S k u a ££ b qi,...,q a €Q «=0 

a 

«i, q a eQ no,..., tt a _ie^ fc tiaG-C 1 ' *=o 



X] II ( ]C ^-Bfe' U '*+l) J X! VBiq^Ui^a+l) 
qi,...,q a GQ i=0 \u<E£ k J \ueS b 



M • d < Cp l where C 
Now, let us prove that tb is absolutely convergent. 



Ci 

„k-i 



E M«0I < E E E ^(<zVb(<z,«,<z')W) < C" 

weE* k£N U ££ k q.q'eQ 

where C" = Cn 2 Max{\i B (q)T B [q')\ :q,q'e Q}/(l-p). 

Lastly, let Mb be the square matrix defined by M B [i, j] = ^b (<?i , qj)- Since the 
spectral radius of a matrix depends continuously on its coefficients and since A is a re- 
duced representation of a stochastic language, any MA satisfying (0 with a sufficiently 
small must have a spectral radius <1 (Prop.|4}. Therefore, if B satisfies and (0 with 
a sufficiently small, the Prop.[3]entails the conclusion. □ 

It remains to show how a series which converges absolutely to 1 can be used to 
approximate a stochastic language. 

Let r be a series over £ such that YlweS* r ( w ) converges absolutely to 1 . Therefore, 
r(X) = J2uex r ( u ) ' s defined without ambiguity for every X C U* and r(X) is 
bounded by r = Y^ues* l r ( u )l- Let S be the smallest subset of £* such that 

e G S and Vu G £*, Vx G S, u G S and r(uxS*) > => ux G 5. 

5 is a prefixial subset of 17* and Vu G 5, r(uS*) > 0. For every word u G S, let 
us define iV(u) = U{uxS* : x G S,r(uxE*) < 0} U {w : ifr(u) < 0} and N = 
U{N(u) : u G £*}. Then, for every u G S, let us define A„ by: 

A e = (1 - riNie)))- 1 and A M = A, 



' r(tia;Z'*) — r(N(ux)) 
Lemma 6. For every word u <E S, e r ^ N '' r < X u < 1. 

Proof. First, check that r(N(u)) < for every u E S. Therefore, A„ < 1. Now, check 
that if u,uv G S then u = e or iV(ii) n N(uv) — 0. Let m = x\ . . . x n G S* where 
x±, .. . ,x n G £ and let uo = e and Uj = u^\Xi for 1 < i < n. We have 

_A r( Ui S*) _fr( r(N(u t )) 

u LL r{u . IJ * ) _ r{N{u . )) LL\ r{uiS *) 

and 

Since r(uiS*) < f, logA„ > £^ =0 r(JV(m))/F = r(U? =0 JV(ui))/f > r(iV)/r. 
Therefore, A„ > e r ( Ar )/ r . □ 

Letp r be the series defined by: p r (u) = if u G N andp r (u) = X u r(u) otherwise. 
We show that pr is a stochastic language. 



Lemma 7. - p r (e) + X E J2xesnz r(xS*) = 1, 



- For any u G £* and any x G £, if ux G S then 

p r (ux) + X ux r(uxy£*) = \ u r(ux£*). 

{y££:uxy£S} 

Proof. First, check that for every u G S, 

p r {u) + \ u ^2 r(ux£*) = X u (r(u£*) - r(N(u)). 
xeu~ 1 sns 

Then, p r (e) + \ E T,xeSns r ( xS *) = ~ r ( N ( £ ))) = L Now > let u e £* and 
x G £ s.t. ux G S, p r (ux) + KxJ2{ y eS:uxyes} r ( ux y S *) = X ux (r(ux£*) - 
r(N(ux))) = X u r(ux£*). □ 

Lemma 8. Let Q be a prefixial finite subset of '£* and let Q s = (Q£ \ Q) D S. Then 

Pr (Q) = 1 - Kr{uxS*). 

ux£Q 31 x£U 

Proof. By induction on Q. When Q = {e}, the relation comes directly from Lemma0 
Now, suppose that the relation is true for a prefixial subset Q', let uq G Q 1 and xq G £ 
such that uqXq g" Q' and let Q = Q' U {uqXq}. We have 

Pr{Q) =Pr(Q') +Pr(u X ) = 1 - ^ A u r (ux£* ) + p r (mo^o) 

ux£Q' g: x££ 

where Q' s = (Q'£ \ Q 1 ) D 5, from inductive hypothesis. 

If UqXq $ S, check that p t {uqXq) = and that Q s = Q' s . Therefore, p r {Q) = 

1 - T,um£Q a , X ££ Kr(ux£*). 

If u xo G S, then Q s = Q' s \ {u x } U (u x £ n S). Therefore, 
p r (Q) = 1- ^ A^u^Z 1 *) +p r (it x ) 

= 1 - 2J A^u^Z 1 *) - X Uo r(u x £*) 

= 1— \ u r(ux£*) from Lemma[7J □ 

ux^Q s ,x€iU 

Proposition 7. Lef r be a formal series over £ such that Ylwes* r ( w ) converges ab- 
solutely to 1. Then, p r is a stochastic language such that for every u G £* \ N, 

(1 + r{N)/r)r{u) < e r{N) ' 7 r{u) < p r (u) < r(u). 

Proof. From Lemma|6] the only thing that remains to be proved is that p r is a stochastic 
language. Clearly, p r (u) G [0, 1] for every word u. From Lemma|8] for any integer k, 

\1- Pr (£^ k )\< ]T r(u£*) < r{£> k ) 
u£E k + 1 nS 

which tends to since r is absolutely convergent. □ 



To sum up, DEES computes MA A whose structure is equal to the structure of the 
target from some steps, and whose parameters tends reasonably fast to the true param- 
eters. From some steps, they define absolutely rational series ta which converge abso- 
lutely to 1. By using these MA, it is possible to efficiently compute p rA (u) or p rA (u£*) 
for any word u. Moreover, since ta converges absolutely and since A tends to the target, 
the weight ta (N) of the negative values tends to and p rA converges to the target. 

5 Conclusion 

We have defined an inference algorithme DEES designed to learn rational stochastic 
languages which strictly contains the class of stochastic languages computable by PA 
(or HMM). We have shown that the class of rational stochastic languages over Q is 
strongly identifiable in the limit. Moreover, DEES is an efficient inference algorithm 
which can be used in practical cases of grammatical inference. The experiments we have 
already carried out confirm the theoretical results of this paper: the fact that DEES aims 
at building a natural and minimal representation of the target provides a very significant 
improvement of the results obtained by classical probabilistic inference algorithms. 
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